posted 02-13-2008 12:14 AM
Finally - this has been tormenting on me for a while.Any suggestions are welcome, including criticism, concerns, moans groans, and gripes.
-----------------------------------------
A Brief Response to Inaccuracies in Zelicoff's Monte Carlo Simulation of Polygraph Accuracy Estimates
Zelicoff (2007) while critical of Raskin & Honts' (2002) use of unweighted averaging of polygraph accuracy estimates, committed two far more serious and misleading errors in his Monte Carlo estimates of polygraph accuracy rates. First, Zelicoff misrepresented the normal range of two standard deviations surrounding his Monte Carlo mean estimates as the 95% confidence interval, instead of correctly calculating the standard errors of the estimate as would be indicated in any undergraduate statistics text. Zelicoff's estimates of of positive predictive value (PPV) and negative predictive values (NPV) are therefore grossly inaccurate and misleading to readers seeking to understand the potential accuracy of polygraph decisions in field practice. Second, Zelicoff, though he acknowledges that inconclusive test results are a matter of reality in polygraph testing and other testing contexts, has chosen to emphasize inconclusive test results as decision errors under the guise “errors of ambiguity” (pg. 8).
Confidence intervals are correctly calculated as the range of two standard errors around the mean, not standard deviations. It is a foundation of undergraduate statistics that standard errors of an estimate are calculated as the standard deviation of the estimate divided by the square root of the N .Zelicoff could have easily calculated the correct standard errors and confidence intervals, using the N of the Monte Carlo sample space. The result of this error is that Zelicoff's reported confidence intervals are unexpectedly and unrealistically wide, and cannot be considered an accurate representation of what polygraph program managers might expect from field examiners. What Zelicoff has referred to as a confidence interval is nothing more than the normal range of results, which occur within two standard deviations of the mean, which he has also calculated erroneously due the second fatal error in his Monte Carlo simulation.
While it is important for polygraph professionals and consumers of polygraph test results to be aware of both Bayesian and frequentist models of accuracy and error estimation, is it equally important to recognize that while neither method is sufficient to completely address the range of questions regarding all the facets of test accuracy estimation that concern researchers and test developers, there are inherent differences in the generalizability of any accuracy estimates derived from Bayesian and frequentist models. Simple false positive (FP) rates and false negative (FN) rates are frequentist calculations which are derived from within uniform sets or subsets of deceptive or truthful subjects. The inverse of these calculations are recognized as the sensitivity and specificity rates. However, actual calculation of FP, FN, sensitivity, and specificity rates are not uniform, because of the bimodal decision scheme that is inherent to field polygraph testing. Field polygraph decisions include an inconclusive zone that can be expected to reduce the occurrence of decision errors while at the same time causing the condition in which the sum of sensitivity plus FN or specificity plus FP does not equal one (1) as Zelicoff would like to assert. The advantage of these frequentist estimates, which are calculated within the set of deceptive or truthful subjects, is that it is straightforward to study the generalizability of these estimates from one sample or population to another, regardless of the sample size or base-rates of truthful or deceptive cases. While FP and FN rates are considered robust against changes in sample size and base-rates assuming a sample that is suitably large enough to provide a continuous distribution of potential result values, the generalization of the Bayesian estimates of positive predictive values (PPV) and negative predictive values (NPV) are calculated not through a uniform sample space but through a matrix that includes both deceptive and truthful subjects. Bayesian PPV and NPV estimates are therefore non-robust against differences in the base rates or proportions of the truthful and deceptive subsets. Despite the known poor generalizability of Bayesian estimates to situations with unknown base-rates or base rates that vary substantially from the laboratory or development models Zelicoff would have us use those non-robust estimates instead of the more suitably generalizable frequentist estimates of sensitivity and specificity.
Zelicoff commits further methodological errors in suggesting a binary decision scheme that is highly inconsistent with the polygraph decision schemes that are used in field practice. The data described by Honts (2002) makes no attempt impose such a classification scheme, and instead provides the results of several studies in the ternary scheme that is common to polygraph testing and other sciences. Zelicoff disregards the fact that investigation into the rates and causes of inclusive test results represents an important and unique line of investigative inquiry with potential opportunities to reduce both inconclusive results and decision errors. The elimination of inconclusive results represent Zelicoff's unilateral decree that inconclusive test in field practice represent an erroneous decision on the part of a polygraph examiner. Zelicoff includes data from calculations which include an inconclusive zone, but unilaterally rejects there meaning and relevance to actual field practice, and instead erroneously asserts that his uniform decision model with no inconclusive zone somehow more closely approximates field polygraph practices.
While Zelicoff inaccurately argues his unimodal and binary Monte Carlo simulation as representative of expected results field polygraph situations, actual field polygraph practices represent a signal detection problem in the form of bimodal test of a Gaussian model (Wickens, 2002; Wickens 1991). Polygraph tests are not bimodal in the sense that the distribution models are bimodal, but in the sense that the field examiner must compare at test results against two different distribution models, one representing the probability the test result is included in the expected distribution of truthful scores and another representing the probability the test result is included in the distribution of deceptive scores. Some computer scoring algorithms using a single uniform distribution, (Polyscore, 1994), and (CPS, Kircher, 1988, 2002). However, the bimodal model of the Objective Scoring System (Krapoh & McMannus, 1999; Krapohl, 2002) are more representative of the how field examiner's utilize polygraph test data.
The logic of signal detection theory and inferential statistics is intended to determine the probability that an observed test is included in or represented by a known model. Virtually all forms of testing all forms of testing are a process of gathering data, and then evaluating how well the data do or do not fit a know model that was derived through the study of data. In the case of polygraph, we have two known models, one for truthful and another for untruthful persons. Decisions, in the form of rejection of a null hypothesis that there is no difference, are made when that probability occurs below a specified threshold referred to as alpha. Alpha levels are specified before testing begins, and may be as much decisions of policy as they are decisions of science. Researchers commonly use .05, .01, .001, and .1, depending on how critical their needs are, with corresponding error tolerance rates 1 in 20, 1 in 100, 1 in 1000, or 1 in 10, depending on which alpha level was specified. Program managers may specify different alpha boundaries for different purposes, (e.g., a prosecutor deciding not to file charges on a murder suspect might require alpha at .01, or even .001, while police applicants might be retained for further evaluation for hire with alpha at .05 or .l). The advantage of more conservative alpha levels (i.e., smaller decimal numbers) is greater confidence in decisions based on greater confidence of rejection from a particular model. The disadvantage of more rigorous alpha boundaries will be an inevitable increase in inconclusive results that occur near the alpha boundary. The arbitrary assignment of inconclusive results as errors, as Zelicoff does, can be accomplished only by neglecting the specified alpha boundary for decisions pertaining to the alternative category.
A test result result that occurs outside the specified alpha boundary cannot be automatically assumed to be within, or outside the normal range or specified alpha boundary for the alternate distribution model. If the two probability models are reasonably separated, the major portion of all possible result values will occur within the normal range one probability model and outside the specified alpha boundary of the other model. Depending on the location and shape of those two probability distributions, and the specified decision alpha, it is possible that a smaller number of possible result values may exist outside the alpha boundaries for both probability models. It would be impossible for a researcher or field examiner to assign a result from that region to either distribution group by any means other than guessing. Yet Zelicoff feels justified in doing exactly that. In the case that the two probability models are insufficiently separated, it is conceivable that s small number of test results may exceed the specified decision boundary for both models, depending on the specified alpha. Again, it would be impossible for researcher or field examiner to assign such test results to either the truthful or deceptive category by any means other than arbitrary or random assignment. Despite Zelicoff's assertions, it is unlikely that we will ever observe normative distribution that are so convenient that the desired alpha boundaries abut one another with perfect uniformity and no inconclusive zone. Statisticians, researchers, and field polygraph examiners do not guess or make arbitrary conclusions when the data themselves do not lead to a single mathematically driven classification as Zelicoff has suggested. When the data are inconclusive, the test result is inconclusive.
While the notion that a result may be statistically rejected from membership in either probability model this may be bothersome to those who desire simplistic models and assertive answers, whether supported or not, it is important to note that test results occur in the form of probability statements, and do are not themselves material substances. Probability statements become simplified into terms such as “positive,” “negative, and “inconclusive.” The terms “positive” and “negative” do not represent value judgments within the testing context, but simply indicate whether the investigator found evidence of or identified the phenomena of interest. Value judgments are later applied situationally. Positive results might become a form of bad new, as when a biopsy is positive for cancerous cells, or may be good news if a patients is hoping for a positive result from a pregnancy test.
Zelicoff's decision to regard inconclusive decision as decision errors, under the vague term “errors of ambiguity,” represents a substantial departure from and inconsistency with not only field polygraph testing, but the will established principles of Gaussian models in signal detection theory. The result of Zelicoff's decision to neglect the bimodal nature of the polygraph test is the distortion of the alpha decision boundary and conclusions that cannot be argued to representative of what would occur in polygraph field practice.
Test results are the mathematical transformation of measurement data into probability values referred to as p-values. When a null hypothesis states that there is no difference between a known distribution and an expected test result, that null hypothesis can be rejected when the observed p-values less than the specified alpha. The term “p-value” refers to the probability estimation for how well a result fits a known model. Is is common practice to start with the null hypothesis that there is no difference between an observed result and the known models of either truthful or untruthful persons. Investigators then look for justification, based on measurements and statistical significance, to reject each of the null hypothesis. Barland (1985) first described the application of a model for polygraph error estimation which closely resembles the Gaussian model (Wickens, 1991; Wickens, 2002). In that model, an observed test result would result in a nondeceptive classification when it is observed to be a poor fit, (i.e., the observed p-value is less than the specified alpha/decision threshold), when compared to the distribution of scores for a known model for untruthful persons and occurs within the normal range of a known model for truthful persons. Likewise, if an observed test result is observed to be a poor fit for a known model of truthful persons, and within the normal range of a known model for untruthful persons, then we conclude the person most likely comes from the population of untruthful persons, as represented by the normative data. According to this statistical decision model, assignment to either group requires an observe p-value that is less than the specified decision alpha. Values that are not less than the specified p-value for inclusion in one probability distribution cannot arbitrarily assigned to the other distribution. Though it in no way reflects field practices, Zelicoff has suggested the estimation of polygraph accuracy rate through a model that is inconsistent with the basic Gaussian model in signal detection theory. Zelicoff's assignment of inconclusive results to the alternate distribution, without regard to observed level of significance is both negligent and misleading. Under those circumstances, we can expect to see certain changes in the rates observed decision errors.
The result of Zelicoff's methodological errors would be a predictable increase in polygraph decision errors for result values that occur near the alpha boundaries. The becomes especially evident when considering an example: with alpha for deceptive classifications set at .05, Zelicoff would have us regard a test result with a p-value of .06 as a conclusive non-deceptive result. Clearly, this would be a reckless suggestion in field practice and would not occur. The practical and methodologically appropriate solution is the one in place, in which data that fail to meet specified thresholds for significance cannot be regarded as a basis for a conclusion and must be regarded as inconclusive.
r
------------------
"Gentlemen, you can't fight in here. This is the war room."
--(Stanley Kubrick/Peter Sellers - Dr. Strangelove, 1964)